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(57) Abstract: A speaker verification system is provided for identifying whether an input portion of speech originated from a par- 
ticular speaker. A set of features is extracted from an input portion of speech provided by the speaker. A first scoring means (4) 
scores the set of features with a first stored model of mixture components derived from sets of features extracted from input portions 
of speech provided by a plurality of speakers. A second scoring means (12) scores the set of features with a second stored model of 
mixture components derived from sets of features extracted from input portions of speech provided by the speaker to be identified. 
The results are compared to determine whether the input portion of speech did originate from that particular speaker. The system 
provides that the first scoring means (4) scores the set of features with only part of the first stored model most likely to provide a 
good match to the set of features provided. 
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SPEAKER VERIFICATION 



This invention relates to a speaker verification 
system and in particular to a speaker verification system 
based on the principles proposed in our British patent 
application serial no. GB-A-224 8513 . 

Speaker verification is important in applications 
such as financial transactions which are carried out 
automatically by telephone. Some of the problems of 
speaker verification are reduced by forming what are known 
as Gaussian Mixture models {GMM) for a number of 
utterances using features of these utterances from a large 
number of speakers. These models are known as world 
models. In addition, for every person whose speech is to 
be recognised, a GMM is formed. These models are known as 
personal or speaker models and comprise mixture components 
with which input utterances will be processed. 

in speech verification, a person says an isolated or 
connected utterances and features from each of these 
utterances are extracted and into feature vectors. After 
this, the probabilities that these features vectors could 
have been generated for these words by the world model and 
by the personal model of that person are calculated and 
these probabilities are compared for the utterances. A 
decision on a verification for the speaker is then based 
25 on a poll of these comparisons. 

More particularly, a system such as this operates by 
cutting an incoming stream of speech data into short 
sections or frames to allow feature extraction. A front 
end process extracts a set of features from each frame, 
30 these features being a function of the input speech 

signal. These features are then stored as a vector. The 
feature vectors are then further used for comparison to a 
world and speaker model. 

During the verification process a sequence of feature 
35 vectors is processed with both the world and the speaker 
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model. Only the features of the speaker model which best 
match the world model are processed and therefore only a 
small number of mixture components need to be processed 
with a speaker model. Therefore, most of the processing is 
5 needed for computing the world model. 

For example, a GMM with 1024 mixture components would 
require each of these to be processed with an input vector 
and if a speaker model only had the best five mixture 
components then only these five would need to be processed 
10 with an input vector. Therefore, the world model 

processing needs considerably more of the computation ' in 
the verification system. 

The present invention seeks to reduce the processing 
of the mixture components from the world model thereby 
15 considerably reducing the computational overhead of the 
whole process. 

The invention is defined with more precision in the 
appended claims to which reference should now be made. 

A preferred embodiment of the invention will now be 
20 described in detail, by way of example, with reference to 
the accompanying drawings in which: 

FIGURE 1 shows a block diagram of a system embodying 
the invention; 

FIGURE 2 shows a block diagram of a basic world model 

25 scoring system; and 

FIGURE 3 shows a component predicted world model 
scoring system in accordance with an embodiment of the 
invention; and 

FIGURE 4 shows schematically how component prediction 

30 is performed. 

The diagram of Figure 1 shows an input speech signal 
to a front end processor 2 which produces as its output a 
feature vector. This is achieved as described above by 
cutting the speech signal into frames and from each frame 

35 extracting a set of features which are then combined into 
a feature vector for that frame. 
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The next stage in the process is that the feature 
vectors are provided to a world model storing unit 4 . 
This also receives, as an input , mixture components from a 
world GMM which comprises mixture components for all 
5 possible speakers. The scoring process with the world 
model leads to a ranking of the mixture components 
according to the likelihood score for the given feature 
vector. It will be appreciated that the score for the 
comparison of each feature vector with each mixture 
10 component can only give a likelihood as there will be 
small variations in input speech every time a speaker- 
provides an input signal. These variations will also 
occur as a result of the frames from which the feature 
vectors are extracted having cut points at different times 
15 in the speaker's speech each time speech is analysed. 

Therefore, there will be no exact matches between feature 
vectors and mixture components from the world model. All 
that will be produced is a likelihood score for the given 
feature vector corresponding to one of the world model 
20 mixture components . 

After this likelihood scoring process has taken 
place, each feature vector is assigned to the best scoring 
mixture component of the world model and these are output 
from the world model score unit 4. The assigned feature 
25 vectors are accumulated in a temporary GMM model 8. This 
temporary model is used for a speaker adaptation process 
in conjunction with the world model to create a speaker 
model 10. The intention is to produce a speaker model 
which is a statistical representation of the speaker's 
30 speech, where each of the speaker' s mixture components has 
exactly one corresponding component in the world model. 
This speaker model can then be used in a speaker model 
scoring unit 12. 

A fast convergence of the speaker model parameter is 
35 one of the most important tasks in the system and is 

performed by the speaker adaptation unit 14. The speaker 
model parameters should change almost immediately to the 
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input characteristic during the initialisation period. 
After a certain time, when enough speaker data are 
collected, the system should change from speaker 
adaptation to speaker tracking. 
5 The tracking allows the system to follow changes in 

the speaker's voice pattern over a longer period of time. 
The tracking should be slow to allow capturing the voice 
over a long time span and not only over the last few 
utterances. This should lead to a more robust estimation 
10 of the speaker's model parameters. 

The speaker adaptation unit 14 performs operations 
based on the standard equation for on-line model re- 
adaptation. In the case of background pre-whitened 
feature vectors, the equation is: 

n m±8 

v v -> * v ' 

=/ -i-r 

15 where £ S/ ,c is an already estimated speaker model parameter 
from the speaker model 10 which represents m seen frames, 
jj. St ,c is the new mean accumulated over the last segment in 
the temporary model 8 with the representative weight of n. 
The new re-adapted model parameter % Si .t represents n+m 

20 frames and is calculated with weights, according to n and 

m, for^.r and /u St ,: respectively. The numeral 8 is only a 
preferred value, and the values may be appropriate in 
other circumstances . 

A problem of the accumulation is the memory usage for 

25 each parameter. ^ s ,.? can be stored with 16 bit resolution 

whereas ^u s ,, e is stored with only 8 bit. The averaging is 
not very accurate if the resolution for M s ," is reduced to 
8 bit. Therefore the storage of the temporary model 8 is 
twice as large as the speaker model 10. 
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A simple way of reducing the memory size of the 
temporary model is to store only a sub-set of all mixture 
components in the temporary model. This enables the 
memory to be reduced to a half of the original size. The 
5 components are chosen by the frequency of their 

occurrence. Components with a high frequency are kept in 
memory in the temporary model. If a frame is seen for a 
component which is not in the memory, the component with 
the lowest frequency in the memory is checked and possibly 
10 exchanged with the new component from the world model 6. 

The disadvantage of the temporary model reduction is 
the loss of information. Therefore the speaker model 
estimation will take longer. 

Speaker tracking uses the same equation as in 
15 adaptation. The weighting factor y is set to a certain 

value instead of being calculated individually according 
to the number of seen frames. 



The adaptation will change to a tracking approach if 
enough frames have been received to train the mixture 

20 component parameters for the target speaker. In this 

system, this is the case when 255 frames have been 
processed for a certain mixture component, but other 
values could be selected. 

There are several ways to choose y- The value should 

25 not be too large to avoid a fast tracking which weights 

the newer data very high and therefore loses information 
from the past quickly. If y is chosen too low, the 
target speaker' s change might not be captured and the 
speaker is locked out by the system. The value for y also 

30 depends on the model size and the length of the test 

segment . 

To validate performance of the system, first a test 
with standard adaptation is performed- This reveals the 
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baseline performance for further testing of memory 
reductions on the temporary model and the speaker tracking 
settings. Tests are performed with a model size of 64 
mixture components. Larger model sizes might not obtain 
5 any changes between the different tracking strategies due 

to the small amount of training sessions available. A 
comparison is also made for gender dependent background 
models and a combined gender model for the adaptation 
process . 

10 Once the mixture components have been inserted in the 

speaker model, the feature vectors and mixture components 
output by the world model scoring unit are provided to the 
speaker model scoring unit 12. Producing a likelihood 
score for the speaker model involves only the processing 

15 of the most likely mixture components with the feature 

vectors. These components also retain high scores from 
the speaker model due to the component correspondence 
between speaker and world model. Therefore, only a small 
number of components are processed for scoring the speaker 

20 model - 

The output scores of the world and speaker model 
scoring units are likelihood scores which are input to a 
normalisation and decision unit 16. The two world and 
speaker model scores are normalised by subtracting the 

25 world model score. This is compared to a threshold and 
the speaker is accepted or rejected by the system in 
dependence on the difference signal or falling above or 
below the threshold. An accept or reject signal is then 
output by the normalisation and decision unit 16. 

30 A process known as component prediction can be used 

to speed up the processing of the world model scoring. It 
will be appreciated that in the diagram described above, 
each of the input feature vectors from the front end 
processor 2 has to be compared with each of the mixture 

35 components from the world model 6 in the world model 

scoring unit 4. This task is therefore computationally 
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very expensive both in initial training of the unit for a 
speaker and for subsequent testing. 

To understand how component prediction works, the 
standard world model scoring system is first explained in 
5 more detail without component prediction. This is done 

with reference to Figure 2 which shows the task of world 
model scoring. 

The world model, which is a GMM, consists of a number 
of mixture components 20. 

10 In the scoring process, each of the mixture 

components 20 in the world model is processed with an • 
input feature vector in a scoring unit 22 . The result of 
this scoring is a likelihood score for each of the mixture 
components. The likelihood of the scores for all of the 

15 components are sorted according to their values to produce 
a likelihood ranking of the mixture components. The best 
scoring components recognised by their indices are used 
for further processing and are stored for each feature 
vector in a best scoring component store 24. 

20 The other output from the scoring unit 22 is the 

world score. For each feature vector, this is calculated 
by combining the likelihood scores of the best scoring 
component . 

In Figure 3, a component predicted world model 
25 scoring system is shown which illustrates the extension of 
the world model scoring using component prediction. 

In this, the indices of the best scoring components 
stored at 24 are used to choose indices from a look-up 
table index 26. These indices point to information about 
30 which mixture components are most likely to obtain high 

likelihood scores for the following feature vector. Thus, 
these contain data about which mixture components should 
be selected for comparison with the next input feature 
vector-. Only a subset of all the components is selected, 
35 eg. 5 components according to the data from the look-up 

table. This component selection is performed by component 
selection unit 2 8 before being provided to the scoring 



WO 02/067245 PCT/GB02/00665 



- 8 - 

unit 22 for scoring with the next feature vector. The 
scoring of this next feature vector again leads to best 
scoring components and the indices of these components are 
again used for a prediction for the following feature 
5 vector. Thus, only a small number of mixture components 
from the world model are processed with each feature 
vector thereby considerably reducing the computational 
overhead- The saving will depend on the exact number of 
mixture components selected. 
io The look-up tables which are referenced by the 

indices of best scoring components are now described in 
more detail. 

Given the sequence of feature vectors, which is a 
known sequence, the idea is to predict certain mixture 

15 components from the world model which are most likely to 
achieve high scores for processing of the immediately 
succeeding vector. This prediction is based on a data 
driven estimation of the most likely component indices . 
For example, a total of 25 components might be predicted 

20 from a total of 1124 mixture components in a world model 
thereby reducing the processing time by over 95%. 

The prediction is derived from transition 
probabilities for transversing from a mixture component J 
to a mixture component I in acoustic space . 

25 The Gaussian component of a world GMM can be trained 

using the EM-algorithm published in the Statistical 
Society, 39:1-38,1997 by a Dempster, A., Laird, N., and D. 
Rubin under the heading "Maximum likelihood from 
incomplete data via the EM algorithm". This algorithm 

30 assumes equal probabilities for all state transitions. 

However, transition probabilities can be calculated after 
training of the mixture components. These transition 
probabilities allow prediction of certain mixture 
components for the processing of the succeeding vector. 

35 Figure 4 shows an overview of the prediction scheme. 

A feature vector is processed using the world GMM. A 
likelihood score of the feature vector is calculated to 
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each mixture component (Stage 1) . These scores are sorted. 
The look-up tables of the most likely mixture components 
are used for prediction (stage 2) - The component indices 
of these tables are copied into a component prediction 
5 array for the processing of the next feature vector. 

Several parameters can be varied in the prediction 
process. These are the number of mixture components used 
for prediction and the size of the look-up tables. 
Another aspect is the processing of the first feature 

10 vector in a vector sequence. The prediction may not be 

used for the first frame which is similar to a calculation 
of all components in the GMM. 

Most of the times the prediction produces the same 
substantially same result as full processing. Sometimes 

15 the prediction deteriorates with the frame likelihood 

score but the prediction does not degenerate into a random 
component calculation. The initial component prediction 
obtains good results when used for the first frame of a 
speech segment. This is only in general, the differences 

20 between the two initialisation methods are minor at the 
start of a speech segment. 

Global transition estimates can be averaged over all 
consecutive vector pairs. An initial transition 
probability for the start of a vector sequence can be 

25 defined for a number of extracted speech segments. 

The transition probabilities P(I:J) are sorted for 
each coiqponent J and only the most likely indices of I are 
stored ia the look up table 26. The table for each mixture 
component can vary, in size. When a feature vector is 

30 processed, with the world GMM it leads to a most likely 
mixture component. This is stored in the indices of 
aposteirori components 24 and is used to select the look 
up table to be used by the component selection unit 28. 
The table contains the indices of the mixture components 

35 which will be processed for the next feature vector, these 
mixture components being the most likely mixture 
components to follow the current feature vector. 
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Returning now to Fig. 1, it will be explained how 
component prediction improves the performance of the 
system of Figure 1 for both the initial training of the 
speaker model and for subsequent testing of a speaker. 
5 When a new speaker model is to be created for a new 

user this is done by the speaker saying a known sequence 
of words a certain number of times and each utterance of 
the word being used to generate the most likely components 
from the world model which correspond to that speaker. 

10 This is done by first using the world model scoring unit 4 
with extracted feature vectors. Initially, the first ■ 
feature vector is scored with all the components of the 
world model 6 and an index of the best scoring components 
with this stored at 24. The look up table store 26 then 

15 provides data corresponding to the most likely set of 

components which should be compared with the next feature 
vector. This most likely set is the most likely set of 
the world components not of any particular speaker's 
components . These are then scored with the next input 

20 vector and the process repeats. 

At the same time, the temporary model 8 receives the 
feature vectors output by the world model scoring unit 4 
and a speaker adaptation unit 14 uses this to produce a 
speaker model 10 for that particular speaker. Thus, a 

25 speaker model is created and is stored for future 
reference . 

In testing, a speech input is tested against a 
speaker model. This can be done by a user who is to be 
identified first inputting, for example, a first 

30 identification number or some other identifier. This 

causes the system to load what it believes to be a speaker 
model for that speaker into the speaker model store 10. An 
utterance of speech from the speaker is then processed by 
the front end processor and the world model scoring unit 

35 which operates according to the system Figure 3. That is 
to say, it first scores the first input vector with all 
the components from the world model 6 before using the 
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index of best scoring components 24 and look up table 
store 26 to select by component selection unit 28 the most 
likely set of components for scoring against the next 
vector. 

5 Feature vector mixture indices are supplied to the 

speaker model scoring which operates only with the mixture 
components in the speaker model 10. Thus, the processing 
of vectors is significantly reduced and thereby the 
computational overhead. It will be appreciated that in 
10 training the full potential of the speed up in computation 
is achieved due to the processing of just one speaker- 
model . 
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1. A speech verification system for identifying whether 
5 an input portion of speech originated from a particular 

speaker comprising: 

means to extract a set of features from an input 
portion of speech provided by the speakers- 
first means for scoring the set of features with a 
10 first stored model of mixture components derived from 

sets of features extracted from input portions of speech 
provided by a plurality of speakers; 

second means for scoring the set of features with a 
second stored model of mixture components derived from 
15 sets of features extracted from input portions of speech 
provided by the speaker to be identified; 

means for comparing results provided by the first and 
second scoring to determine whether the input portion of 
speech did originate from the said particular speaker; 
20 characterised in that 

the first scoring means scores the set of features 
with only the part of the first stored model most likely 
to provide a good match to the set of features. 

2. A speech verification system according to claim 1 in 
25 which the first scoring means includes indices of best 

scoring mixture components for sets of features, a look up 
table containing data identifying the portions of the 
first stored model most likely to provide a good score 
with the next set of features, and means for selecting the 
30 said portions of the first stored model for scoring with 
the next- set of features. 

3. A speech verification system according to claims 1 
and 2 in which the extracting means extracts a plurality 
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of sets of features from the input portion of speech and 
each set of features comprises feature vectors derived 
from the sequence of sounds in an interval. 

4. A system according to claim 1 in which the system is 
5 arranged to initialise to a new speaker during an 

initialisation period, wherein the second means for 
scoring the set of features does this a predetermined 
number of times for the new speaker scores them with an 
estimated set of mixture components in the second stored 
10 model and modifies the estimated set of mixture components 
after each iteration of the scoring means. 

5. A system according to claim 1 or 4 in which the 
second stored model is modified if the comparison means 
indicates that the input speech came from a particular 

15 speaker and the stored model is modified in dependence on 
the result of the scoring of the set of features with the 
second stored model. 

6. A system according to claim 5 in which a weighting 
factor is used to modify the second stored model with the 

20 set of features. 

7. A method for speech verification for identifying 
whether an input portion of speech originated from a 
particular speaker comprising the steps of: 

extracting a set of features from an input portion of 

25 speech provided by the speaker; 

scoring the set of features with a first stored model 
of mixture components derived from sets of features 
extracted for input portions of speech provided by a 
plurality of speakers; 

30 second means for scoring the set of features with a 

second stored model of mixture components derived from 
sets of features extracted from input portions of speech 
provided by the speaker to be identified; 
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comparing the results provided by the respective 
scoring steps to determine whether the input portion of 
speech do. originate from the speaker to be identified; 

characterised in that the scoring of the set of 
5 features with the first stored model of mixture components 

comprises scoring the features only with the part of the 
first scored model most likely to provide a good match to 
the input set of features. 

8 . A method according to claim 4 in which the part of 
10 the first stored model most likely to provide a good match 

to the set of features comprises looking up previously 
stored data corresponding to the most likely set of 
components which should be compared with the next feature 
vector from the first scored model and scoring this set of 
15 mixture components with the next set of features. 

9. A method according to claim 7 including the step of 
initialising to a new speaker during an initialisation 
period in which the scoring of the set of features with 
the second stored model is repeated a predetermined number 

20 of times commencing with an estimated set of mixture 

components in the second stored model, and modifying the 
estimated set of mixture components with each iteration. 

10. A method according to claim 7 or 9 including the step 
of modifying the second stored model if the company step 

25 indicates that the input speech come from a particular 

speaker in dependence on the result of the scoring of the 
set of features with the second stored model. 

11. A method according to claim 10 in which a weighting 
factor is used in the modifying step. 
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